## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
This tidy dataset contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating from 0 (very bad) to 10 (very excellent).
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
The distribution of fixed acidity is positive skewed. Most of the wines have fixed acidity between 7.10 and 9.20.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
The valatile acidity shows a bimodal distribution and positive skewness. Most of the wines have volatile acidity between 0.39 and 0.64.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.120 7.680 8.445 8.847 9.740 16.285
Total acidity is composed of fixed and volatile acidity. The distribution of total acidity is positive skewed with median at 8.445.
The residual sugar shows left-biased and long-tailed distribution.
The chlorides show left-biased and long-tailed distribution.
The total sulfur dioxide has some outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
Most of the wines have a density between 0.9956 and 0.9978.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
Most of the wines have pH between 3.210 and 3.400.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
Most of the wines have 5 or 6 in quality.
There are 1,5999 red wines in the dataset with 13 features (X, fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, quality). X identifies the wines, and quality represents that how good the wine. The X and quality are unordered and ordered factor variables, but I treated them as numerical variables for convenience. All other features represent chemical properties of wine.
Other observations:
The main feature in the data set is quality. I’d like to determine which features are best for predicting the wine quality. I suspect quality and some combination of the other variables can be used to build a predictive model for wine quality.
The primary wine characteristics are sweetness, acidity, tannin, alcohol, and body. Residual sugar, fixed and volatile acidity, alcohol, and density determine those characteristics. I guess that these variables are mainly related to the wine quality.
I created a variable for the total acidity using the volatile and the fixed acids.
Volatile acidity shows a bimodal distribution.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00 -0.26 0.67
## volatile.acidity -0.26 1.00 -0.55
## citric.acid 0.67 -0.55 1.00
## residual.sugar 0.11 0.00 0.14
## chlorides 0.09 0.06 0.20
## free.sulfur.dioxide -0.15 -0.01 -0.06
## total.sulfur.dioxide -0.11 0.08 0.04
## density 0.67 0.02 0.36
## pH -0.68 0.23 -0.54
## sulphates 0.18 -0.26 0.31
## alcohol -0.06 -0.20 0.11
## quality 0.12 -0.39 0.23
## total.acidity 0.99 -0.16 0.63
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.11 0.09 -0.15
## volatile.acidity 0.00 0.06 -0.01
## citric.acid 0.14 0.20 -0.06
## residual.sugar 1.00 0.06 0.19
## chlorides 0.06 1.00 0.01
## free.sulfur.dioxide 0.19 0.01 1.00
## total.sulfur.dioxide 0.20 0.05 0.67
## density 0.36 0.20 -0.02
## pH -0.09 -0.27 0.07
## sulphates 0.01 0.37 0.05
## alcohol 0.04 -0.22 -0.07
## quality 0.01 -0.13 -0.05
## total.acidity 0.12 0.10 -0.16
## total.sulfur.dioxide density pH sulphates alcohol
## fixed.acidity -0.11 0.67 -0.68 0.18 -0.06
## volatile.acidity 0.08 0.02 0.23 -0.26 -0.20
## citric.acid 0.04 0.36 -0.54 0.31 0.11
## residual.sugar 0.20 0.36 -0.09 0.01 0.04
## chlorides 0.05 0.20 -0.27 0.37 -0.22
## free.sulfur.dioxide 0.67 -0.02 0.07 0.05 -0.07
## total.sulfur.dioxide 1.00 0.07 -0.07 0.04 -0.21
## density 0.07 1.00 -0.34 0.15 -0.50
## pH -0.07 -0.34 1.00 -0.20 0.21
## sulphates 0.04 0.15 -0.20 1.00 0.09
## alcohol -0.21 -0.50 0.21 0.09 1.00
## quality -0.19 -0.17 -0.06 0.25 0.48
## total.acidity -0.11 0.68 -0.67 0.16 -0.08
## quality total.acidity
## fixed.acidity 0.12 0.99
## volatile.acidity -0.39 -0.16
## citric.acid 0.23 0.63
## residual.sugar 0.01 0.12
## chlorides -0.13 0.10
## free.sulfur.dioxide -0.05 -0.16
## total.sulfur.dioxide -0.19 -0.11
## density -0.17 0.68
## pH -0.06 -0.67
## sulphates 0.25 0.16
## alcohol 0.48 -0.08
## quality 1.00 0.09
## total.acidity 0.09 1.00
The fixed acidity and volatile acidity has strong positive and negative correlations with citric acid.
The pH has a strong negative correlation with fixed acidity, citric acid, but does not with volatile acidity.
The fixed acidity and alcohol have significant positive and negative correlations with density, respectively.
Most of the variables do not seem to have strong correlations with quality, but alcohol and volatile acidity have moderate positive and negative correlation with quality, respectively.
The strongest correlation in this data set appears between fixed acidity and pH. High acidity means low pH, and the graph coincides with this fact.
Citric acid is ne of the main component of fixed acidity. Therefore the two variable has a strong positive correlation.
The fixed acidity has a strong positive correlation with density, too.
Yeast in wine convert citric acid to acetic acid, most of the volatile acid. Therefore, volatile acidity and citric acid is in a reverse relation.
The citric acid has moderate negative correlations with volatile acidity and pH.
The alcohol and density also show moderate negative correlation.
Quality of wine tends to increase as volatile acidity decreases, because the main component of volatile acid is acetic acid which causes an unpleasant vinegar taste.
##
## Call:
## lm(formula = quality ~ volatile.acidity, data = wqr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.79071 -0.54411 -0.00687 0.47350 2.93148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.56575 0.05791 113.39 <2e-16 ***
## volatile.acidity -1.76144 0.10389 -16.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7437 on 1597 degrees of freedom
## Multiple R-squared: 0.1525, Adjusted R-squared: 0.152
## F-statistic: 287.4 on 1 and 1597 DF, p-value: < 2.2e-16
Based on the value of R-squared, volatile acidity contributes only about 15.2% of the Wine quality.
##
## Call:
## lm(formula = quality ~ I(sqrt(alcohol)), data = wqr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8551 -0.4087 -0.1711 0.5115 2.5870
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.0237 0.3538 -5.72 1.27e-08 ***
## I(sqrt(alcohol)) 2.3756 0.1096 21.68 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7101 on 1597 degrees of freedom
## Multiple R-squared: 0.2274, Adjusted R-squared: 0.2269
## F-statistic: 469.9 on 1 and 1597 DF, p-value: < 2.2e-16
Based on the value of R-squared, alcohol contributes to the wine quality only about 15.2%.
Residual sugar determines the sweetness of the wine. Most of the wine maintain an certain level of sweetness.
The quality correlates with alcohol and volatile acidity.
Citric acid is one of the main components of fixed acidity. As a result, they have a strong positive correlation.
High fixed acidity causes low pH. Therefore, fixed acidity and citric acid negatively correlates with the pH.
Wine with more volatile acidity tends to have less citric acid.
Wine with more fixed acidity tends to denser. By the way, A wine with more alcohol tends to less dense.
The fixed acidity is positively and strongly correlated with citric acid and density. The citric acid may substitute for fixed acidity and density with even better estimation of wine quality.
c(cor(wqr$volatile.acidity, wqr$sulphates),
cor(wqr$volatile.acidity, log10(wqr$sulphates)))
## [1] -0.2609867 -0.3005487
Transformation of sulphates to log10(sulphates) increase the correlation between sulphates and volatile acidity.
c(cor(wqr$alcohol, wqr$pH), cor(wqr$alcohol, wqr$pH^7))
## [1] 0.2056325 0.2287039
Transformation of pH to pH^7 increases the correlation between pH and alcohol little bit. As shown below, this leads the increase of our model accuracy little bit.
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wqr)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = wqr)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + sulphates,
## data = wqr)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## chlorides, data = wqr)
## m5: lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## chlorides + total.sulfur.dioxide, data = wqr)
## m6: lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## chlorides + total.sulfur.dioxide + pH, data = wqr)
## m7: lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## chlorides + total.sulfur.dioxide + pH + citric.acid, data = wqr)
##
## ==========================================================================================================================
## m1 m2 m3 m4 m5 m6 m7
## --------------------------------------------------------------------------------------------------------------------------
## (Intercept) 1.875*** 3.095*** 2.611*** 2.777*** 3.005*** 4.296*** 4.613***
## (0.175) (0.184) (0.196) (0.199) (0.204) (0.400) (0.461)
## alcohol 0.361*** 0.314*** 0.309*** 0.292*** 0.277*** 0.291*** 0.295***
## (0.017) (0.016) (0.016) (0.016) (0.016) (0.017) (0.017)
## volatile.acidity -1.384*** -1.221*** -1.167*** -1.142*** -1.038*** -1.115***
## (0.095) (0.097) (0.097) (0.097) (0.100) (0.115)
## sulphates 0.679*** 0.874*** 0.915*** 0.889*** 0.899***
## (0.101) (0.111) (0.110) (0.110) (0.110)
## chlorides -1.645*** -1.705*** -2.002*** -1.915***
## (0.394) (0.392) (0.398) (0.403)
## total.sulfur.dioxide -0.002*** -0.002*** -0.002***
## (0.001) (0.001) (0.001)
## pH -0.435*** -0.525***
## (0.116) (0.133)
## citric.acid -0.167
## (0.121)
## --------------------------------------------------------------------------------------------------------------------------
## R-squared 0.227 0.317 0.336 0.343 0.351 0.357 0.358
## adj. R-squared 0.226 0.316 0.335 0.341 0.349 0.355 0.355
## sigma 0.710 0.668 0.659 0.655 0.651 0.649 0.649
## F 468.267 370.379 268.912 208.125 172.683 147.427 126.712
## p 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1721.057 -1621.814 -1599.384 -1590.682 -1580.383 -1573.351 -1572.389
## Deviance 805.870 711.796 692.105 684.612 675.850 669.931 669.126
## AIC 3448.114 3251.628 3208.768 3193.364 3174.767 3162.701 3162.778
## BIC 3464.245 3273.136 3235.654 3225.626 3212.407 3205.719 3211.173
## N 1599 1599 1599 1599 1599 1599 1599
## ==========================================================================================================================
The first trial of linear model accounts for 35.7% of the variance. The variables with less significance were removed.
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wqr)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = wqr)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + I(log10(sulphates)),
## data = wqr)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + I(log10(sulphates)) +
## chlorides, data = wqr)
## m5: lm(formula = quality ~ alcohol + volatile.acidity + I(log10(sulphates)) +
## chlorides + total.sulfur.dioxide, data = wqr)
## m6: lm(formula = quality ~ alcohol + volatile.acidity + I(log10(sulphates)) +
## chlorides + total.sulfur.dioxide + I(pH^7), data = wqr)
## m7: lm(formula = quality ~ alcohol + volatile.acidity + I(log10(sulphates)) +
## chlorides + total.sulfur.dioxide + I(pH^7) + citric.acid,
## data = wqr)
##
## ==========================================================================================================================
## m1 m2 m3 m4 m5 m6 m7
## --------------------------------------------------------------------------------------------------------------------------
## (Intercept) 1.875*** 3.095*** 3.369*** 3.742*** 3.998*** 4.003*** 4.099***
## (0.175) (0.184) (0.184) (0.201) (0.208) (0.207) (0.212)
## alcohol 0.361*** 0.314*** 0.303*** 0.285*** 0.270*** 0.289*** 0.295***
## (0.017) (0.016) (0.016) (0.016) (0.016) (0.017) (0.017)
## volatile.acidity -1.384*** -1.156*** -1.099*** -1.076*** -0.940*** -1.043***
## (0.095) (0.097) (0.098) (0.097) (0.101) (0.114)
## I(log10(sulphates)) 1.477*** 1.794*** 1.843*** 1.849*** 1.894***
## (0.177) (0.190) (0.189) (0.188) (0.190)
## chlorides -1.694*** -1.729*** -2.063*** -1.935***
## (0.383) (0.380) (0.385) (0.390)
## total.sulfur.dioxide -0.002*** -0.002*** -0.002***
## (0.001) (0.001) (0.001)
## I(pH^7) -0.000*** -0.000***
## (0.000) (0.000)
## citric.acid -0.228
## (0.118)
## --------------------------------------------------------------------------------------------------------------------------
## R-squared 0.227 0.317 0.345 0.353 0.361 0.370 0.371
## adj. R-squared 0.226 0.316 0.344 0.352 0.359 0.367 0.368
## sigma 0.710 0.668 0.654 0.650 0.646 0.642 0.642
## F 468.267 370.379 280.646 217.837 180.338 155.588 134.130
## p 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1721.057 -1621.814 -1587.752 -1577.984 -1568.023 -1557.699 -1555.809
## Deviance 805.870 711.796 682.108 673.825 665.482 656.943 655.393
## AIC 3448.114 3251.628 3185.503 3167.967 3150.046 3131.397 3129.619
## BIC 3464.245 3273.136 3212.389 3200.230 3187.686 3174.414 3178.013
## N 1599 1599 1599 1599 1599 1599 1599
## ==========================================================================================================================
The variables in this linear model can account for 37.0% of the variance in the quality of the wine. By using log10(sulphates) and pH^7, we could improve the result compared to 35.7% without transformation.
##
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + I(log10(sulphates)) +
## chlorides + total.sulfur.dioxide + I(pH^7) + citric.acid,
## data = wqr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.63753 -0.37786 -0.03801 0.44159 1.96876
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.099e+00 2.123e-01 19.308 < 2e-16 ***
## alcohol 2.948e-01 1.708e-02 17.260 < 2e-16 ***
## volatile.acidity -1.043e+00 1.138e-01 -9.161 < 2e-16 ***
## I(log10(sulphates)) 1.894e+00 1.895e-01 9.995 < 2e-16 ***
## chlorides -1.935e+00 3.905e-01 -4.954 8.04e-07 ***
## total.sulfur.dioxide -2.207e-03 5.023e-04 -4.394 1.18e-05 ***
## I(pH^7) -6.244e-05 1.265e-05 -4.936 8.83e-07 ***
## citric.acid -2.281e-01 1.176e-01 -1.940 0.0525 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6418 on 1591 degrees of freedom
## Multiple R-squared: 0.3711, Adjusted R-squared: 0.3684
## F-statistic: 134.1 on 7 and 1591 DF, p-value: < 2.2e-16
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Transformation of sulphate and pH increases the correlations with other variables. These transformations give clue to make a better linear model.
High alcohol and low volatile acidity contents seem to produce better wines.
I created a couple of linear models. Though the confidence level of the model could be increased a bit by transforming a couple of variables, the final model still is not satisfactory. This can be due to the fact that our dataset contains a small number of observations. Furthermore, most of the observations are from middle-classed wines. This makes it difficult that the model predict the edge cases. Maybe a more supplement dataset with more edge cases would help to predict the accurate quality of wines.
Alcohol percentage plays a primary role in determining the quality of wines. The higher the alcohol percentage, the better the wine quality. But previously from our linear model test, R-Squared value tells that alcohol alone contributes only about 22% in the variance of the wine quality. So alcohol is not the only factor which is responsible for the improvement in wine quality.
The volatile acidity has a negative relation with wine quality, though it is weaker than that of alcohol. It seems that the main component of volatile acid is an acetic acid which causes the unpleasant vinegar taste.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can see that the model fails to predict the good and bad quality wines. This is evident from the fact that most data sets contain ‘average’ quality wine and there are insufficient observations in the extreme range. The R-squared value of our model can only account for about 37.1% observations.
In this data, my main struggle was to get a higher confidence level when predicting factors that are responsible for the production of different quality of wines especially the ???Good??? and the ???Bad??? ones. As the data was very centralized towards the ???Average??? quality, my training set did not have enough data on the extreme edges to accurately build a model which can predict the quality of a wine given the other variables with lesser margin of error. So maybe in future, I can get a dataset about Red Wines with more complete information so that I can build my models more effectively.
Initially when I was writing and developing for this project, I saw that some wines didn???t have citric acid at all. Also the others showed almost a rectangular distribution. My first thought was maybe this was bad data or incomplete data. But then I researched further about wines. I saw that citric acid actually is added to some wines to increase the acidity. So it???s evident that some wines would not have Citric Acid at all. So actually this was in parallel to my experimental findings.
The other variables showed either a Positively skewed or a Normal Distribution.
First I plotted different variables against the quality to see Univariate relationships between them and then one by one I threw in one or more external factors to see if they together have any effect on the categorical variable. I saw that the factors which affected the quality of the wine the most were Alcohol percentage, Sulphate and Acid concentrations.
I tried to figure out the effect of each individual acid on the overall pH of the wine. Here I found out a very peculiar phenomenon where I saw that for volatile acids, the pH was increasing with acidity which was against everything I learned in my Science classes.
But then to my utter surprise, for the first time in my life as a data analyst, I saw the legendary Simpson???s Paradox at play where one lurking variable was reversing the sign of the correlation and in turn totally changing the trend in the opposite direction.
In the final part of my analysis, I plotted multivariate plots to see if there were some interesting combinations of variables which together affected the overall quality of the wine. It was in this section I found out that density did not play a part in improving wine quality.
For future analysis, I would love to have a dataset, where apart from the wine quality, a rank is given for that particular wine by 5 different wine tasters as we know when we include the human element, our opinion changes on so many different factors. So by including the human element in my analysis, I would be able to put in that perspective and see a lot of unseen factors which might result in a better or worse wine quality. Having these factors included inside the dataset would result in a different insight altogether in my analysis.